A Case for Abstract Cost Models for Distributed Execution of Analytics Operators
نویسندگان
چکیده
We consider data analytics workloads on distributed architectures, in particular clusters of commodity machines. To find a job partitioning that minimizes running time, a cost model, which we more accurately refer to as makespan model, is needed. In attempting to find the simplest possible, but sufficiently accurate, such model, we explore piecewise linear functions of input, output, and computational complexity. They are abstract in the sense that they capture fundamental algorithm properties, but do not require explicit modeling of system and implementation details such as the number of disk accesses. We show how the simplified functional structure can be exploited by directly integrating the model into the makespan optimization process, reducing complexity by orders of magnitude. Experimental results provide evidence of good prediction quality and successful makespan optimization across a variety of cluster architectures.
منابع مشابه
NScale: Neighborhood-centric Analytics on Large Graphs
There is an increasing interest in executing rich and complex analysis tasks over large-scale graphs, many of which require processing and reasoning about a large number of multi-hop neighborhoods or subgraphs in the graph. Examples of such tasks include ego network analysis, motif counting in biological networks, finding social circles, personalized recommendations, link prediction, anomaly de...
متن کاملBuilding Efficient and Cost-Effective Cloud-based Big Data Management Systems
Title of dissertation: BUILDING EFFICIENT AND COST-EFFECTIVE CLOUD-BASED BIG DATA MANAGEMENT SYSTEMS Abdul Hussain Quamar, Doctor of Philosophy, 2015 Dissertation directed by: Professor Amol Deshpande Department of Computer Science In today’s big data world, data is being produced in massive volumes, at great velocity and from a variety of different sources such as mobile devices, sensors, a pl...
متن کاملRuntime Prediction for Scale-Out Data Analytics
Many analytics applications generate mixed workloads, i.e., workloads comprised of analytical tasks with different processing characteristics including data pre-processing, SQL, and iterative machine learning algorithms. Examples of such mixed workloads can be found in web data analysis, social media analysis, and graph analytics, where they are executed repetitively on large input datasets (e....
متن کاملAfterburner: The Case for In-Browser Analytics
This paper explores the novel and unconventional idea of implementing an analytical RDBMS in pure JavaScript so that it runs completely inside a browser with no external dependencies. Our prototype, called Afterburner, generates compiled query plans that exploit typed arrays and asm.js, two relatively recent advances in JavaScript. On a few simple queries, we show that Afterburner achieves comp...
متن کاملA New Bi-Objective Model for a Multi-Mode Resource-Constrained Project Scheduling Problem with Discounted Cash Flows and four Payment Models
The aim of a multi-mode resource-constrained project scheduling problem (MRCPSP) is to assign resource(s) with the restricted capacity to an execution mode of activities by considering relationship constraints, to achieve pre-determined objective(s). These goals vary with managers or decision makers of any organization who should determine suitable objective(s) considering organization strategi...
متن کامل